Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

position_deletes metadata table #1615

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

amitgilad3
Copy link
Contributor

Implements position_deletes metadata table - #1053

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally LGTM! i added a few nit comments on style.

I think its a good idea to introduce the PositionalDelete class, similar to DataFile

@@ -657,3 +717,19 @@ def all_manifests(self) -> "pa.Table":
lambda args: self._generate_manifests_table(*args), [(snapshot, True) for snapshot in snapshots]
)
return pa.concat_tables(manifests_by_snapshots)

def position_deletes(self) -> "pa.Table":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: for any metadata tables referencing the current snapshot, let's add an optional param to allow querying for any given snapshot id.


from pyiceberg.io.pyarrow import schema_to_pyarrow

partition_record = self.tbl.metadata.specs_struct()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: specs_struct() looks at all PartitionSpecs, do we just need the current partition spec here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also nit: naming, this is a structtype, not the underlying record

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch , i now take the current PartitionSpec

partition_record = self.tbl.metadata.specs_struct()
pa_partition_struct = schema_to_pyarrow(partition_record)
pa_row_struct = schema_to_pyarrow(self.tbl.schema().as_struct())
positinal_delete_schema = pa.schema(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
positinal_delete_schema = pa.schema(
positional_delete_schema = pa.schema(

pa.field("delete_file_path", pa.string(), nullable=False),
]
)
return positinal_delete_schema
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return positinal_delete_schema
return positional_delete_schema

Comment on lines 892 to 895
def _read_delete_file(fs: FileSystem, data_file: DataFile, schema: "pa.Schema") -> pa.Table:
delete_fragment = _construct_fragment(fs, data_file, file_format_kwargs={"pre_buffer": True, "buffer_size": ONE_MEGABYTE})
table = ds.Scanner.from_fragment(fragment=delete_fragment, schema=schema).to_table()
return table
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this might be a good place to start introducing a PositionalDelete class and we can get rid of the custom _get_positional_file_schema function

@kevinjqliu
Copy link
Contributor

also lets add this to new table to the docs as well https://py.iceberg.apache.org/api/#inspecting-tables

@amitgilad3
Copy link
Contributor Author

done :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants